A Methodology to Retrieve Text Documents from Multiple Databases

نویسندگان

  • Clement T. Yu
  • King-Lup Liu
  • Weiyi Meng
  • Zonghuan Wu
  • Naphtali Rishe
چکیده

This paper presents a methodology for finding the n most similar documents across multiple text databases for any given query and for any positive integer n. This methodology consists of two steps. First, the contents of databases are indicated approximately by database representatives. Databases are ranked using their representatives with respect to the given query. We provide a necessary and sufficient condition to rank the databases optimally. In order to satisfy this condition, we provide three estimation methods. One estimation method is intended for short queries; the other two are for all queries. Second, we provide an algorithm, OptDocRetrv, to retrieve documents from the databases according to their rank and in a particular way. We show that if the databases containing the n most similar documents for a given query are ranked ahead of other databases, our methodology will guarantee the retrieval of the n most similar documents for the query. When the number of databases is large, we propose to organize database representatives into a hierarchy and employ a best-search algorithm to search the hierarchy. It is shown that the effectiveness of the best-search algorithm is the same as that of evaluating the user query against all database representatives.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

روش جدید متن‌کاوی برای استخراج اطلاعات زمینه کاربر به‌منظور بهبود رتبه‌بندی نتایج موتور جستجو

Today, the importance of text processing and its usages is well known among researchers and students. The amount of textual, documental materials increase day by day. So we need useful ways to save them and retrieve information from these materials. For example, search engines such as Google, Yahoo, Bing and etc. need to read so many web documents and retrieve the most similar ones to the user ...

متن کامل

Developing a Comprehensive Patent-Related Information Retrieval Tool

There is an explosive growth of regulatory and related information now available online. This paper reviews the current state of practice in accessing patent-related documents in the form of patents, government regulations, court cases, scientific publications, etc. This paper proposes an ontology-based framework to retrieve documents from the multiple heterogeneous databases. A use case, eryth...

متن کامل

Documents meet Databases: A System for Intranet Search

In enterprise intranets, information is encoded in documents and databases. Logically, the information in both worlds is tightly connected, however, on the system level there is usually a large gap. In this paper, we propose a system to retrieve documents in the enterprise intranet. The system is an extension to common text search. It does not only consider the content of documents but also it ...

متن کامل

Modeling Query-Based Access to Text Databases

Searchable text databases abound on the web. Applications that require access to such databases often resort to querying to extract relevant documents because of two main reasons. First, some text databases on the web are not “crawlable,” and hence the only way to retrieve their documents is via querying. Second, applications often require only a small fraction of a database’s contents, so retr...

متن کامل

Literature Review on Automatic Text Summarization: Single and Multiple Summarizations

The online information available on world wide web is in enormous amount. Search engines like Google, Yahoo were developed to retrieve information from the databases. But actual results were not obtained as the electronic information is increasing day by day. Thus automatic summarization came into demand. Automatic summarization gathers several documents as input and provides the shorter summar...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • IEEE Trans. Knowl. Data Eng.

دوره 14  شماره 

صفحات  -

تاریخ انتشار 2002